Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9610 / 000098_owner-urn-ietf _Thu Oct 24 15:15:51 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 8KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id PAA17697 for urn-ietf-out; Thu, 24 Oct 1996 15:15:51 -0400 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id PAA17691 for <urn-ietf@services.bunyip.com>; Thu, 24 Oct 1996 15:15:47 -0400 Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA28846 (mail destined for urn-ietf@services.bunyip.com); Thu, 24 Oct 96 15:15:44 -0400 Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <01391-0@josef.ifi.unizh.ch>; Thu, 24 Oct 1996 21:15:31 +0100 Subject: Re: [URN] Unicode for NSS query To: tallen@fsc.fujitsu.com Date: Thu, 24 Oct 1996 21:15:30 +0100 (MET) Cc: rdaniel@acl.lanl.gov, urn-ietf@bunyip.com In-Reply-To: <199610241725.KAA04756@ishtar.fsc.fujitsu.com> from "Terry Allen" at Oct 24, 96 10:25:11 am Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 6937 From: Martin J Duerst <mduerst@ifi.unizh.ch> Message-Id: <"josef.ifi..192:24.09.96.20.15.32"@ifi.unizh.ch> Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch> Errors-To: owner-urn-ietf@bunyip.com Terry Allen wrote: >Thanks to Patrik and Ron. I think I need to go out and buy >those Unicode books today. Some clarifications: Version 2.0 is out now, and it's only one book. Addison-Wesley, urn:isbn:0-201-48345-9 :-). >| > - why care? the NSS is supposed to be opaque >| Which means? I think it means we can't go around inferring >| structure in arbitrary namespaces. Opaque does not mean >| that we have to take whatever comes. There can be particular >| requirements on the characters allowed. >| I think this WG should make >| an explicit decision on trying to go to UNICODE rather than >| just defaulting to ASCII. The reasons for doing so: >| 1) The IAB workshop tells us to try to do so. >| 2) Not all the namespaces in the world use the Latin alphabet >| without accents. >| 3) Isn't it time we start getting away from the US-ASCII assumption? > >I don't want to assume ASCII, but I was wondering, why think in terms >of characters and not octets? I have suggested that we might be required to thing in terms of both characters and octets. For some things, similar to a data: URL, thinking in characters might be artificial. For some other things, such as URLs, thinking in octets may to some extent be necessary because of backwards compatibility issues (assume an URL scheme is extended and decides to use some weird RFC 1522-like method for encoding characters, and this would have to be grandfathered). >If the NSS is thought of as a series >of octets, and the meaning of those octets is defined by the scheme, >the resolver should have no problem. You give one reason below >(getting to the resolver). What else am I missing? (transcribability?) Transcribability is definitely an issue. Assume there is some very well established tradition >| > - does this imply that a) NSSs should be formed originally >| > in Unicode, or that b) NSSs in other coded character sets >| > must be translated/transliterated into Unicode in forming >| > URNs, or c) something else? >| >| What I assume happens is that namespaces are defined in terms of >| glyphs, not coded character sets. Terry here and below makes many comments about what we want to encode not being glyphs but characters. I agree. I think Ron Daniel used the term glyph with the meaning "character without assigned encoding". To clear up terminology, a character by itself does not enclude a code. It is something like a small logical text component. A glyph (in standard terminology) is the appearance of a character. With Latin letters, this distinction only turns up with e.g. "a" or "g", where the "g" can be closed at the bottom or not, and these would be two glyphs for the same character. So replacing all the occurrences of "glyph" in Ron's comment by "character", things become much more clear. It's then basically a discussion of whether: - Namespaces appear as characters, without specific encoding, or - Namespaces appear as characters with their encoding already defined and, for the later, whether: - They would have to be reencoded to Unicode - They should be left as is My personal oppinion is that namespaces will appear in both variants, both already encoded and not yet encoded, and also as encoded entities without any relation to characters. As for reencoding, I would propose to suggest in the document that NSS define a reencoding if possible and appropriate, and do so if possible by referring to existing documents (Unicode has mapping tables between e.g. ISO 8859-6 (Latin/Arabic) and Unicode). I would also suggest that the document propose that NSS or URL schemes that have not yet defined character semantics beyond ASCII, but that would like to do so, strongly consider to do so in accordance with URN name syntax (i.e. using UTF-8). >| >As an example, suppose that I have an existing name space, >| >well ordered in every respect, in ISO 8859-6 (Latin/Arabic). >| >I want this name space to be used for URNs, and (supposing >| >that the coded character set is not an obstacle at this >| >point) get it registered. As the upper half of 8859-6 is >| >not a subset of Unicode, per the present syntax draft, >All the *characters* of the upper half of 8859-6 can be represented >by various combinations of Unicode glyphs, as I understand matters. Yes. But all the ligatures and contextual forms are in the compatibility section. The mapping is really straightforward and well-defined. >| Right. We have been assuming that the user enters a URN into >| a browser using their local language/script/... and the browser >| has the job of making it into UNICODE/UTF-8/%encoded. > >Terminological question: is the URN as entered a URN? or does >it become a URN only when translated? Interesting question. But really only terminology. I don't assume anybody will publish an URN with %HH notation in e.g. a Japanese newspaper. So even if we terminologically specify that only the entered and translated URN is an URN, users will thing differently. >| Are there guidelines already in place on how, when a user wants >| an "A" plus "grave" (or whatever), that is to be encoded? Is it >| reasonable for us to cite such guidelines as the way namespaces >| should be encoded? (Also, is it SHOULD be encoded or MUST be >| encoded). > >That I can't answer (yet). I have posted a question to this respect to the unicore (unicode specialists) list. The answer is that: - Unicode defines that A-Grave and A followed by combining Grave are equivalent. - Unicode does not give any preference to any one of these forms. I see the main reason for this state in the fact that up to now, Unicode was mainly designed with text editing in mind, and it was assumed that each text editor could convert incomming stuff to whatever it preferred. For things such as URNs, it would definitely be nice to have a set of clear preferences. SHOULD be encoded would not help much, because it would not take any burden from servers/resolvers (unless we want to put the burden on the user, which in the case of A-grave would be a really bad idea). >| > Say >| >I can have outcomes A, B, and C, all of them legitimate >| >representations of my 8859-6 name in Unicode. Are >| >urn:mynamespace:A, urn:mynamespace:B, and urn:mynamespace:C >| >equivalent? >| >| I think that we can't mandate that they be lexically equivalent. >| That requires FAR too much knowledge on the part of all URN software. >| It may be that a resolver can be made smart enough that, when >| dealing with a particular language and alphabet, it can recognize >| "A with grave" as equivalent to "A" and "grave". I think that is outside >| the bounds of what we standardize. > >Fine by me. I think that for Arabic and similar cases, with lots of compatibility codes, it would be nice to say that compatibility codes SHOULD not be used in URNs. This is not a problem of 8859-6 -> Unicode conversion, but just of Arabic in general. Regards, Martin.